This is by us. Handling crime data of Chicago from 2001 to 2017
Thanks to These projects:
- https://www.kaggle.com/fahd09/eda-of-crime-in-chicago-2005-2016
- 朱小波, 李昕, 叶信岳. 数据关联背景下芝加哥市一般盗窃案件的多维度分析[J]. 犯罪研究, 2018(4):10.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import gc
import math
import random
import folium
from datetime import datetime
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.metrics import r2_score
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import plot_cross_validation_metric, add_changepoints_to_plot, plot_plotly
import json
from fbprophet.serialize import model_to_json, model_from_json
Crime_2001_to_2004 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2001_to_2004.csv", error_bad_lines=False)
Crime_2005_to_2007 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2005_to_2007.csv", error_bad_lines=False)
Crime_2008_to_2011 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2008_to_2011.csv", error_bad_lines=False)
Crime_2012_to_2017 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2012_to_2017.csv", error_bad_lines=False)
b'Skipping line 1513591: expected 23 fields, saw 24\n' /home/hanpeng/anaconda3/envs/Thefts_in_Chicago/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3172: DtypeWarning: Columns (17,20) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name, b'Skipping line 533719: expected 23 fields, saw 24\n' b'Skipping line 1149094: expected 23 fields, saw 41\n'
print(Crime_2001_to_2004.columns)
Crime_2001_to_2004.head()
Index(['Unnamed: 0', 'ID', 'Case Number', 'Date', 'Block', 'IUCR',
'Primary Type', 'Description', 'Location Description', 'Arrest',
'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code',
'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude',
'Longitude', 'Location'],
dtype='object')
| Unnamed: 0 | ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | ... | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 879 | 4786321 | HM399414 | 01/01/2004 12:01:00 AM | 082XX S COLES AVE | 0840 | THEFT | FINANCIAL ID THEFT: OVER $300 | RESIDENCE | False | ... | 7.0 | 46.0 | 06 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 1 | 2544 | 4676906 | HM278933 | 03/01/2003 12:00:00 AM | 004XX W 42ND PL | 2825 | OTHER OFFENSE | HARASSMENT BY TELEPHONE | RESIDENCE | False | ... | 11.0 | 61.0 | 26 | 1173974.0 | 1876757.0 | 2003.0 | 04/15/2016 08:55:02 AM | 41.817229 | -87.637328 | (41.817229156, -87.637328162) |
| 2 | 2919 | 4789749 | HM402220 | 06/20/2004 11:00:00 AM | 025XX N KIMBALL AVE | 1752 | OFFENSE INVOLVING CHILDREN | AGG CRIM SEX ABUSE FAM MEMBER | RESIDENCE | False | ... | 35.0 | 22.0 | 20 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 3 | 2927 | 4789765 | HM402058 | 12/30/2004 08:00:00 PM | 045XX W MONTANA ST | 0840 | THEFT | FINANCIAL ID THEFT: OVER $300 | OTHER | False | ... | 31.0 | 20.0 | 06 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 4 | 3302 | 4677901 | HM275615 | 05/01/2003 01:00:00 AM | 111XX S NORMAL AVE | 0841 | THEFT | FINANCIAL ID THEFT:$300 &UNDER | RESIDENCE | False | ... | 34.0 | 49.0 | 06 | 1174948.0 | 1831051.0 | 2003.0 | 04/15/2016 08:55:02 AM | 41.691785 | -87.635116 | (41.691784636, -87.635115968) |
5 rows × 23 columns
When trying to read data with
Crime_2001_to_2004 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2001_to_2004.csv")
Got an Error that goes like:
ParserError: Error tokenizing data. C error: Expected 23 fields in line 1513591, saw 24
The solution is:
Crime_2001_to_2004 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2001_to_2004.csv", error_bad_lines=False)
So now it works:
b'Skipping line 1513591: expected 23 fields, saw 24\n' /home/hanpeng/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (17,20) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
The reason being:
读取文件时遇到和列数不对应的行,此时会报错。若报错行可以忽略,则添加该参数 When the column number of one line doesn't met the header row's
To Combine the four files' data into one
gc.collect() to guarentee to save RAM
Crime_data = pd.concat([Crime_2001_to_2004, Crime_2005_to_2007, Crime_2008_to_2011, Crime_2012_to_2017])
del Crime_2001_to_2004
del Crime_2005_to_2007
del Crime_2008_to_2011
del Crime_2012_to_2017
gc.collect()
0
Crime_data.head()
| Unnamed: 0 | ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | ... | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 879 | 4786321 | HM399414 | 01/01/2004 12:01:00 AM | 082XX S COLES AVE | 0840 | THEFT | FINANCIAL ID THEFT: OVER $300 | RESIDENCE | False | ... | 7.0 | 46.0 | 06 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 1 | 2544 | 4676906 | HM278933 | 03/01/2003 12:00:00 AM | 004XX W 42ND PL | 2825 | OTHER OFFENSE | HARASSMENT BY TELEPHONE | RESIDENCE | False | ... | 11.0 | 61.0 | 26 | 1173974.0 | 1876757.0 | 2003.0 | 04/15/2016 08:55:02 AM | 41.817229 | -87.637328 | (41.817229156, -87.637328162) |
| 2 | 2919 | 4789749 | HM402220 | 06/20/2004 11:00:00 AM | 025XX N KIMBALL AVE | 1752 | OFFENSE INVOLVING CHILDREN | AGG CRIM SEX ABUSE FAM MEMBER | RESIDENCE | False | ... | 35.0 | 22.0 | 20 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 3 | 2927 | 4789765 | HM402058 | 12/30/2004 08:00:00 PM | 045XX W MONTANA ST | 0840 | THEFT | FINANCIAL ID THEFT: OVER $300 | OTHER | False | ... | 31.0 | 20.0 | 06 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 4 | 3302 | 4677901 | HM275615 | 05/01/2003 01:00:00 AM | 111XX S NORMAL AVE | 0841 | THEFT | FINANCIAL ID THEFT:$300 &UNDER | RESIDENCE | False | ... | 34.0 | 49.0 | 06 | 1174948.0 | 1831051.0 | 2003.0 | 04/15/2016 08:55:02 AM | 41.691785 | -87.635116 | (41.691784636, -87.635115968) |
5 rows × 23 columns
print('Crime_data loaded!')
Crime_data loaded!
Crime_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7941282 entries, 0 to 1456713 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 ID int64 2 Case Number object 3 Date object 4 Block object 5 IUCR object 6 Primary Type object 7 Description object 8 Location Description object 9 Arrest bool 10 Domestic bool 11 Beat int64 12 District float64 13 Ward float64 14 Community Area float64 15 FBI Code object 16 X Coordinate float64 17 Y Coordinate object 18 Year float64 19 Updated On object 20 Latitude object 21 Longitude float64 22 Location object dtypes: bool(2), float64(6), int64(3), object(12) memory usage: 1.3+ GB
print("Types of crime: \n", np.array(Crime_data['Primary Type'].drop_duplicates()))
Types of crime: ['THEFT' 'OTHER OFFENSE' 'OFFENSE INVOLVING CHILDREN' 'CRIM SEXUAL ASSAULT' 'MOTOR VEHICLE THEFT' 'SEX OFFENSE' 'DECEPTIVE PRACTICE' 'BATTERY' 'BURGLARY' 'WEAPONS VIOLATION' 'PUBLIC PEACE VIOLATION' 'NARCOTICS' 'GAMBLING' 'PROSTITUTION' 'LIQUOR LAW VIOLATION' 'INTERFERENCE WITH PUBLIC OFFICER' 'CRIMINAL DAMAGE' 'ASSAULT' 'STALKING' 'ARSON' 'CRIMINAL TRESPASS' 'HOMICIDE' 'ROBBERY' 'OBSCENITY' 'KIDNAPPING' 'INTIMIDATION' 'RITUALISM' 'DOMESTIC VIOLENCE' 'OTHER NARCOTIC VIOLATION' 'PUBLIC INDECENCY' 'NON-CRIMINAL' 'HUMAN TRAFFICKING' 'CONCEALED CARRY LICENSE VIOLATION' 'NON - CRIMINAL' 'NON-CRIMINAL (SUBJECT SPECIFIED)']
print("Location Type: \n", np.array(Crime_data['Location Description'].drop_duplicates().head()))
Location Type: ['RESIDENCE' 'OTHER' 'APARTMENT' 'RESIDENCE PORCH/HALLWAY' 'GAS STATION']
What makes two cases duplicated?
pd.DataFrame(Crime_data).groupby(['Case Number']).count()
| Unnamed: 0 | ID | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Case Number | |||||||||||||||||||||
| 01G050460 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| 03J493690 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 |
| 04X245238 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 07C115980 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| 07HN36467 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZZ572583 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ZZ591134 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ZZ696090 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ZZ740108 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ZZZ199957 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
6170472 rows × 22 columns
It looks convincing, so be it! Let's drop duplicated ones by that:
print("Before drop_duplicates: ", Crime_data.shape)
Crime_data.drop_duplicates(subset=['Case Number'], inplace=True)
print("After drop_duplicates: ", Crime_data.shape)
Before drop_duplicates: (7941282, 23) After drop_duplicates: (6170473, 23)
Crime_data.head()
| Unnamed: 0 | ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | ... | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 879 | 4786321 | HM399414 | 01/01/2004 12:01:00 AM | 082XX S COLES AVE | 0840 | THEFT | FINANCIAL ID THEFT: OVER $300 | RESIDENCE | False | ... | 7.0 | 46.0 | 06 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 1 | 2544 | 4676906 | HM278933 | 03/01/2003 12:00:00 AM | 004XX W 42ND PL | 2825 | OTHER OFFENSE | HARASSMENT BY TELEPHONE | RESIDENCE | False | ... | 11.0 | 61.0 | 26 | 1173974.0 | 1876757.0 | 2003.0 | 04/15/2016 08:55:02 AM | 41.817229 | -87.637328 | (41.817229156, -87.637328162) |
| 2 | 2919 | 4789749 | HM402220 | 06/20/2004 11:00:00 AM | 025XX N KIMBALL AVE | 1752 | OFFENSE INVOLVING CHILDREN | AGG CRIM SEX ABUSE FAM MEMBER | RESIDENCE | False | ... | 35.0 | 22.0 | 20 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 3 | 2927 | 4789765 | HM402058 | 12/30/2004 08:00:00 PM | 045XX W MONTANA ST | 0840 | THEFT | FINANCIAL ID THEFT: OVER $300 | OTHER | False | ... | 31.0 | 20.0 | 06 | NaN | NaN | 2004.0 | 08/17/2015 03:03:40 PM | NaN | NaN | NaN |
| 4 | 3302 | 4677901 | HM275615 | 05/01/2003 01:00:00 AM | 111XX S NORMAL AVE | 0841 | THEFT | FINANCIAL ID THEFT:$300 &UNDER | RESIDENCE | False | ... | 34.0 | 49.0 | 06 | 1174948.0 | 1831051.0 | 2003.0 | 04/15/2016 08:55:02 AM | 41.691785 | -87.635116 | (41.691784636, -87.635115968) |
5 rows × 23 columns
There are three types of crime that fits the defination of stealing: THEFT, MOTOR VEHICLE THEFT, BURGLARY.
So we should only keep data of them and forget about others.
Crime_data.index = Crime_data['Case Number']
print("Before the drop, it's ", Crime_data.shape)
Crime_data.drop(Crime_data[
(Crime_data['Primary Type'] != "THEFT") &
(Crime_data['Primary Type'] != "MOTOR VEHICLE THEFT") &
(Crime_data['Primary Type'] != 'BURGLARY')
].index, inplace=True, axis=0)
print("After the drop, it's ", Crime_data.shape)
Before the drop, it's (6170473, 23) After the drop, it's (1937260, 23)
gc.collect() to make sure RAM stay low
gc.collect()
0
Turning it into df
df = pd.DataFrame(Crime_data)
There are missing values and null values:
np.any(df.isnull())
True
np.any(df['X Coordinate'] == 0)
True
np.any(df['Y Coordinate'] == 0)
True
So we need to deal with them
#Crime_data[['X Coordinate', 'Y Coordinate']] = Crime_data[['X Coordinate', 'Y Coordinate']].replace(0, np.NaN)
Crime_data[['X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude']] = Crime_data[['X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude']].replace(0, np.NaN)
print("Before we drop NaN: ", Crime_data.shape)
Crime_data.dropna(inplace=True)
df = pd.DataFrame(Crime_data)
print("After we drop NaN: ", Crime_data.shape)
Before we drop NaN: (1937260, 23) After we drop NaN: (1723723, 23)
Crime_data.Date = pd.to_datetime(Crime_data.Date, format="%m/%d/%Y %I:%M:%S %p")
Crime_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 1723723 entries, HM275615 to HZ250453 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 ID int64 2 Case Number object 3 Date datetime64[ns] 4 Block object 5 IUCR object 6 Primary Type object 7 Description object 8 Location Description object 9 Arrest bool 10 Domestic bool 11 Beat int64 12 District float64 13 Ward float64 14 Community Area float64 15 FBI Code object 16 X Coordinate float64 17 Y Coordinate object 18 Year float64 19 Updated On object 20 Latitude object 21 Longitude float64 22 Location object dtypes: bool(2), datetime64[ns](1), float64(6), int64(3), object(11) memory usage: 292.6+ MB
Crime_data.Latitude = Crime_data.Latitude.astype(float)
Crime_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 1723723 entries, HM275615 to HZ250453 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 ID int64 2 Case Number object 3 Date datetime64[ns] 4 Block object 5 IUCR object 6 Primary Type object 7 Description object 8 Location Description object 9 Arrest bool 10 Domestic bool 11 Beat int64 12 District float64 13 Ward float64 14 Community Area float64 15 FBI Code object 16 X Coordinate float64 17 Y Coordinate object 18 Year float64 19 Updated On object 20 Latitude float64 21 Longitude float64 22 Location object dtypes: bool(2), datetime64[ns](1), float64(7), int64(3), object(10) memory usage: 292.6+ MB
There's something wrong with the result at the year of 2017:
EachCountByYear = pd.crosstab(Crime_data['Year'], Crime_data['Primary Type'])
EachCountByYear
| Primary Type | BURGLARY | MOTOR VEHICLE THEFT | THEFT |
|---|---|---|---|
| Year | |||
| 2001.0 | 38 | 639 | 1214 |
| 2002.0 | 18921 | 17493 | 71663 |
| 2003.0 | 24990 | 22651 | 97447 |
| 2004.0 | 20698 | 18779 | 79922 |
| 2005.0 | 25414 | 22384 | 84299 |
| 2006.0 | 24302 | 21785 | 85226 |
| 2007.0 | 24825 | 18551 | 84557 |
| 2008.0 | 25995 | 18613 | 86339 |
| 2009.0 | 26483 | 15309 | 79248 |
| 2010.0 | 26307 | 18945 | 76446 |
| 2011.0 | 26591 | 19375 | 74818 |
| 2012.0 | 22833 | 16481 | 75049 |
| 2013.0 | 17825 | 12532 | 71124 |
| 2014.0 | 14310 | 9793 | 60609 |
| 2015.0 | 13086 | 10039 | 56830 |
| 2016.0 | 13612 | 11005 | 58328 |
plt.plot(EachCountByYear)
plt.show()
The reason is the data only stops at 2017.1.18:
theBeginningOf2017 = datetime(2017, 1, 1)
df[df['Date'] >= theBeginningOf2017].sort_values(['Date'], ascending=False).head()
| Unnamed: 0 | ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | ... | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Case Number |
0 rows × 23 columns
So for the sake of convenience, we should just delete data in 2017
print("Before deleting data in 2017: ", Crime_data.shape)
Crime_data.drop(df[df['Date'] >= theBeginningOf2017].index, inplace=True, axis=0)
df = pd.DataFrame(Crime_data)
print("After deleting data in 2017: ", Crime_data.shape)
Before deleting data in 2017: (1723723, 23) After deleting data in 2017: (1723723, 23)
There's something wrong with the data at the spring of 2004 and before the middle of 2002
EachCountByDay = pd.crosstab(Crime_data['Date'].dt.floor('d'), Crime_data['Primary Type'])
sumOfEachDay = EachCountByDay.sum(axis=1)
percentagePerDay = EachCountByDay.div(sumOfEachDay, axis=0)
plt.figure(figsize=(20, 5))
plt.plot(EachCountByDay)
plt.plot(sumOfEachDay)
plt.ylabel('Crime Count')
plt.legend(['BURGLARY', 'MOTOR VEHICLE THEFT', 'THEFT', 'Sum'])
plt.show()
So we decided to delete them
theBeginningOf2005 = datetime(2005, 1, 1)
print("Before deleting data before 2005: ", Crime_data.shape)
Crime_data.drop(df[df['Date'] < theBeginningOf2005].index, inplace=True, axis=0)
df = pd.DataFrame(Crime_data)
print("After deleting data before 2005: ", Crime_data.shape)
Before deleting data before 2005: (1723723, 23) After deleting data before 2005: (1349268, 23)
gc.collect()
7517
Crime_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 1349268 entries, HM275105 to HZ250453 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1349268 non-null int64 1 ID 1349268 non-null int64 2 Case Number 1349268 non-null object 3 Date 1349268 non-null datetime64[ns] 4 Block 1349268 non-null object 5 IUCR 1349268 non-null object 6 Primary Type 1349268 non-null object 7 Description 1349268 non-null object 8 Location Description 1349268 non-null object 9 Arrest 1349268 non-null bool 10 Domestic 1349268 non-null bool 11 Beat 1349268 non-null int64 12 District 1349268 non-null float64 13 Ward 1349268 non-null float64 14 Community Area 1349268 non-null float64 15 FBI Code 1349268 non-null object 16 X Coordinate 1349268 non-null float64 17 Y Coordinate 1349268 non-null object 18 Year 1349268 non-null float64 19 Updated On 1349268 non-null object 20 Latitude 1349268 non-null float64 21 Longitude 1349268 non-null float64 22 Location 1349268 non-null object dtypes: bool(2), datetime64[ns](1), float64(7), int64(3), object(10) memory usage: 229.0+ MB
countTotal = df.groupby(['Primary Type']).count()['Case Number']
countSum = countTotal.sum()
percentageTotal = countTotal.div(countSum, axis=0)
percentageTotal.plot(kind='pie')
<AxesSubplot:ylabel='Case Number'>
EachCountByYear = pd.crosstab(Crime_data['Year'], Crime_data['Primary Type'])
sumOfEachYear = EachCountByYear.sum(axis=1)
percentagePerYear = EachCountByYear.div(sumOfEachYear, axis=0)
percentagePerYear.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='Year'>
It's too time and RAM consuming, so I'll leave out the plotting part.
dt.floor('d')means to convert the date to only day level
EachCountByDay = pd.crosstab(Crime_data['Date'].dt.floor('d'), Crime_data['Primary Type'])
sumOfEachDay = EachCountByDay.sum(axis=1)
percentagePerDay = EachCountByDay.div(sumOfEachDay, axis=0)
#percentagePerDay.plot(kind='bar', stacked=True, xticks=[])
<AxesSubplot:xlabel='Date'>
EachCountByHour = pd.crosstab(Crime_data['Date'].dt.floor('h'), Crime_data['Primary Type'])
sumOfEachHour = EachCountByHour.sum(axis=1)
percentagePerHour = EachCountByHour.div(sumOfEachHour, axis=0)
#percentagePerHour.plot(kind='bar', stacked=True, xticks=[])
<AxesSubplot:xlabel='Date'>
EachCountByYear
| Primary Type | BURGLARY | MOTOR VEHICLE THEFT | THEFT |
|---|---|---|---|
| Year | |||
| 2005.0 | 25414 | 22384 | 84299 |
| 2006.0 | 24302 | 21785 | 85226 |
| 2007.0 | 24825 | 18551 | 84557 |
| 2008.0 | 25995 | 18613 | 86339 |
| 2009.0 | 26483 | 15309 | 79248 |
| 2010.0 | 26307 | 18945 | 76446 |
| 2011.0 | 26591 | 19375 | 74818 |
| 2012.0 | 22833 | 16481 | 75049 |
| 2013.0 | 17825 | 12532 | 71124 |
| 2014.0 | 14310 | 9793 | 60609 |
| 2015.0 | 13086 | 10039 | 56830 |
| 2016.0 | 13612 | 11005 | 58328 |
plt.plot(EachCountByYear)
plt.plot(sumOfEachYear)
plt.ylabel('Crime Count')
plt.legend(['BURGLARY', 'MOTOR VEHICLE THEFT', 'THEFT', 'Sum'])
plt.show()
EachCountByDay
| Primary Type | BURGLARY | MOTOR VEHICLE THEFT | THEFT |
|---|---|---|---|
| Date | |||
| 2005-01-01 | 48 | 60 | 291 |
| 2005-01-02 | 43 | 55 | 191 |
| 2005-01-03 | 61 | 53 | 194 |
| 2005-01-04 | 69 | 55 | 201 |
| 2005-01-05 | 66 | 57 | 182 |
| ... | ... | ... | ... |
| 2016-12-27 | 44 | 31 | 138 |
| 2016-12-28 | 48 | 33 | 164 |
| 2016-12-29 | 36 | 38 | 131 |
| 2016-12-30 | 42 | 43 | 165 |
| 2016-12-31 | 35 | 39 | 148 |
4383 rows × 3 columns
plt.figure(figsize=(20, 5))
plt.plot(EachCountByDay)
plt.plot(sumOfEachDay)
plt.ylabel('Crime Count')
plt.legend(['BURGLARY', 'MOTOR VEHICLE THEFT', 'THEFT', 'Sum'])
plt.show()
There seems to be an anual pattern.
Also
fig, axes = plt.subplots(nrows=12, ncols=1, figsize=(20, 100))
EachCountByDayDataFrame = pd.DataFrame(EachCountByDay)
EachCountByDayDataFrame = EachCountByDayDataFrame.reset_index(inplace=False)
for year in range(2005, 2017):
i = year - 2005
theBeginningOfTheYear = datetime(year, 1, 1)
theEndOfTheYear = datetime(year + 1, 1, 1)
EachCountByDayDataFrameTheYear = EachCountByDayDataFrame[
(EachCountByDayDataFrame['Date'] >= theBeginningOfTheYear) &
(EachCountByDayDataFrame['Date'] < theEndOfTheYear)
].set_index(['Date'])
sumOfEachCountByDayDataFrameTheYear = EachCountByDayDataFrameTheYear.sum(axis=1)
axes[i].set_title("Data " + str(year) + " Each Day")
axes[i].plot(EachCountByDayDataFrameTheYear)
axes[i].plot(sumOfEachCountByDayDataFrameTheYear)
axes[i].legend(['BURGLARY', 'MOTOR VEHICLE THEFT', 'THEFT', 'Sum'])
for ax in axes:
ax.set_xticks([])
plt.show()
There seems to be no monthly patterns
Also
fig, axes = plt.subplots(nrows=72, ncols=2, figsize=(20, 400))
for year in range(2005, 2006):
for month in range(1, 13):
i = int(((year - 2005) * 12 + month - 1) / 2)#aka row number
j = int((year - 2005) * 12 + month + 1) % 2 #aka column number
newMonth = (month + 1) if (month + 1) <= 12 else 1
theBeginningOfTheMonth = datetime(year, month, 1)
theEndOfTheMonth = datetime(year + int(month / 12), newMonth, 1)
EachCountByDayDataFrameTheMonth = EachCountByDayDataFrame[
(EachCountByDayDataFrame['Date'] >= theBeginningOfTheMonth) &
(EachCountByDayDataFrame['Date'] < theEndOfTheMonth)
].set_index(['Date'])
sumOfEachCountByDayDataFrameTheMonth = EachCountByDayDataFrameTheMonth.sum(axis=1)
axes[i][j].set_title("Data " + str(year) + "." + str(month))
axes[i][j].plot(EachCountByDayDataFrameTheMonth)
axes[i][j].plot(sumOfEachCountByDayDataFrameTheMonth)
axes[i][j].legend(['BURGLARY', 'MOTOR VEHICLE THEFT', 'THEFT', 'Sum'])
axes[i][j].grid()
plt.show()
EachCountByHour
| Primary Type | BURGLARY | MOTOR VEHICLE THEFT | THEFT |
|---|---|---|---|
| Date | |||
| 2005-01-01 00:00:00 | 6 | 3 | 82 |
| 2005-01-01 01:00:00 | 4 | 5 | 8 |
| 2005-01-01 02:00:00 | 0 | 2 | 13 |
| 2005-01-01 03:00:00 | 3 | 5 | 5 |
| 2005-01-01 04:00:00 | 2 | 3 | 6 |
| ... | ... | ... | ... |
| 2016-12-31 19:00:00 | 4 | 1 | 1 |
| 2016-12-31 20:00:00 | 3 | 3 | 3 |
| 2016-12-31 21:00:00 | 0 | 1 | 7 |
| 2016-12-31 22:00:00 | 3 | 3 | 4 |
| 2016-12-31 23:00:00 | 3 | 2 | 9 |
104719 rows × 3 columns
EachCountByHourDataFrame = pd.DataFrame(EachCountByHour)
EachCountByHourDataFrame = EachCountByHourDataFrame.reset_index(inplace=False)
fig, axes = plt.subplots(nrows=72, ncols=2, figsize=(20, 400))
for year in range(2005, 2017):
for month in range(1, 13):
i = int(((year - 2005) * 12 + month - 1) / 2)#aka row number
j = int((year - 2005) * 12 + month + 1) % 2 #aka column number
newMonth = (month + 1) if (month + 1) <= 12 else 1
theBeginningOfTheMonth = datetime(year, month, 1)
theEndOfTheMonth = datetime(year + int(month / 12), newMonth, 1)
EachCountByHourDataFrameTheMonth = EachCountByHourDataFrame[
(EachCountByHourDataFrame['Date'] >= theBeginningOfTheMonth) &
(EachCountByHourDataFrame['Date'] < theEndOfTheMonth)
].set_index(['Date'])
sumOfEachCountByHourDataFrameTheMonth = EachCountByHourDataFrameTheMonth.sum(axis=1)
axes[i][j].set_title("Data " + str(year) + "." + str(month))
axes[i][j].plot(EachCountByHourDataFrameTheMonth)
axes[i][j].plot(sumOfEachCountByHourDataFrameTheMonth)
axes[i][j].legend(['BURGLARY', 'MOTOR VEHICLE THEFT', 'THEFT', 'Sum'])
plt.show()